Reinforcement learning with human feedback (RLHF)

What is Reinforcement learning with human feedback (RLHF)

= finetune the LLM with human feedback data, resulting in a model that is better aligned with human preferences.

How it works

Avoid reward hacking

Reward hacking

  • Reward hacking is a problem in reinforcement learning: the agent learns to cheat the system by favoring actions that maximize the reward received even if those actions don't align well with the original objective.
  • In the context of LLMs, reward hacking can manifest as the addition of words or phrases to completions that result in high scores for the metric being aligned.